Summary#
In this section, we explored the basics of OpenTelemetry, how to instrument our applications and infrastructure, and how to export that telemetry into backend visualization and analysis tools such as Jaeger and Prometheus. We also extended the benefits of metrics by integrating alerting rules to proactively notify us when an application is operating outside of expected behavioral parameters. With the application of what we have learned, we will never be caught blind during a support call. We will have the data to diagnose and resolve issues in our complex system. Better yet, we will know about these problems before issues are raised by our customers.
We also established some relatively simple metrics, traces, and alerts. With this knowledge, we will be able to implement our own traces, metrics, and alerts to empower our team and us to react quickly and efficiently to failures in production.
Quiz#
Consider the following code snippet of the demo-server.yml file:
groups:
- name: demo-server
rules:
- alert: HighRequestLatency
expr: |
histogram_quantile(0.5, rate(http_server_duration_bucket{exported_job="demo-server"}[5m])) > 200000
labels:
severity: page
annotations:
summary: High request latency
What can be inferred from this?
The Prometheus query triggers when the median request latency exceeds 200,000 microseconds (0.2 seconds).
The alert is triggered with a severity label of name and an annotation summary of labels.
We can see a single group named HighRequestLatency which specifies multiple rules.
All of the above
Alerting on Metrics Abnormalities
Introduction